# Creating a Standardized FIF Dataset using EEGUnity
## Prerequisites
Before using this script, you should have:
1. **Toolbox Installation**  
   - Install EEGUnity toolbox.

1. **Familiarity with Python syntax**  
   - Basic understanding of Python programming concepts and structures.

2. **Knowledge of the MNE-Python library (suggested)**  
   - Experience in using MNE-Python for EEG data processing and analysis.

3. **Understanding of EEG paradigms (suggested)**  
   - Familiarity with common EEG experimental paradigms and related concepts.

3. **Time Consuming**  
   - 30 mins

## Purpose 
This tutorial is made for developers to create a `make_fif_xxxx` script (below marked as `standard script`)using the `EEGUnity` library. The script standardizes raw EEG datasets into the FIF format for following any unified processing. It ensures that: 
- **Events** are stored as MNE annotations, with descriptions consistent with the original dataset. 
- **Subject Information** is embedded in the `mne.io.raw.info['description']` field. 
- **Channel Information** is updated if the original data does not sufficiently reflect electrode positions. 
This document is detailed and beginner-friendly, making it accessible even to those without prior knowledge of EEG or programming. Below are the steps to create `standard script`.
## 🚀 Step 0: Prepare a Python Environment

1. **Create a Python environment** with Python version **≥ 3.6**.
   - You can use `venv` or `conda` to manage the environment.
   
2. **Install the `EEGUnity` library** by running the following command in your terminal:
   ```bash
   pip install eeginity

## Step 1 Prepare a Python Project
1. Create a Python project and name it: `standard_script_projects`
2. Create a Python package named `setting` in the root directory of project and create a python file name `path_variable.py` inside the `setting` package
3. Create a folder in the root directory of the project, name `stage1_locator`, for storing locator files. 
4. Create a folder in the root directory of the project
5. Create a folder named `original_raweeg` in any accessible path, to save original EEG datasets (for dataset downloasd).
6. Create a folder named `standard_raweeg` in any accessible path, to save processed standard EEG datasets (folder for storing processed dataset).
7. Download any datasets and unzip EEG datasets in `orignal_raweeg`, such as [BCI Competition IV 2a](https://www.bbci.de/competition/iv/#dataset2a)
8. Define path variable on `path_variable.py`, like
    ```python
    # Input path for each dataset
    bcic_iv_2a_path = r"path/to/original_raweeg/bcic_iv_2a"
    bcic_iv_2b_path = r"path/to/original_raweeg/bcic_iv_2b" # add more path if needed
   
    # output directory for stage 1
    stage1_locator = r"path/to/stage1_locator/" # folder to save stage 1 locator
    stage1_output_dir = r"path/to/standard_raweeg"  # folder to save processed datasets
    ```

## Step 2 Make Standard Script for Datasets
Create a Python file, name `make_fif_xxxx.py` in the root directory of the project. Then, make it by following instrutions.
The script performs the following tasks: 
1. **Parameter Settings**:   
   Define input and output paths, domain tag, and cache usage. For example: 
   ```python 
   # Parameter settings 
   import os
   import setting.path_variable as pv # this is additional python file which stores all path variable
   input_path = pv.bcic_iv_2a_path  # Dataset directory 
   domain_tag = 'bcic-iv-2a'        # Domain tag for marking the dataset 
   output_path = os.path.join(pv.stage1_output_dir, 'bcic_iv_2a')  # Output path 
   use_cache = False 
   ```
   *Note: Avoid absolute paths. Only specify basic parameters.* 
2. **Loading the Dataset**:   
   The script uses the `UnifiedDataset` class to load the dataset. If a locator file exists and caching is enabled, it reuses that file; otherwise, it creates a new locator file. 
   ```python
   locator_path = os.path.join(pv.stage1_locator, os.path.basename(input_path)+".csv")
   if os.path.exists(locator_path) and use_cache:
       unified_dataset = UnifiedDataset(locator_path=locator_path)
   else:
       unified_dataset = UnifiedDataset(dataset_path=input_path, domain_tag=domain_tag)
       unified_dataset.save_locator(locator_path)
   ```
3. **Processing Each Data Row**:   
   A function `app_func` is defined to: 
   - Load the raw EEG data using `get_data_row`, a key function in `EEGUnity`. 
   - Extract the subject ID (e.g., from the file path) and update the subject information (such as age, gender, etc.) in `mne_raw.info['description']`. For convenience, you can check the dataset folder beforehand and store the information in a Python dictionary to simplify loading it within the code. 
   - Rename channels to match the standard system (e.g., 10-20 system). 
   - **Events Handling**:   
     - **⚡ Extract events from the dataset folder** and convert them back to annotations using `mne.annotations_from_events`.  
     **_Note:_** This step is **the most important** and **time-consuming** when creating `standard script`. 🚨
     For more details on MNE annotations, please read the [MNE-Python Annotations Documentation](https://mne.tools/stable/auto_tutorials/plot_annotations.html) :contentReference[oaicite:0]{index=0} carefully.  
   - Handle irregularities, such as nonstandard data (e.g., file names and event annotations), directly within the script.  
     This ensures that users can run the `standard script` immediately after downloading the datasets without additional modifications.
4. **Saving the Processed Data**:   
   After processing, the EEG data is saved as a FIF file. The output filename is modified (by appending `_raw.fif`) to conform to MNE naming conventions. 
5. **Batch Processing**:   
   Finally, the script applies `app_func` to all data rows (filtered to only those marked as `Completed`), processing the entire dataset in batch. 
## Detailed Explanation of the Code 
Below is the full sample script for standardizing the `bcic_iv_2a` dataset, you can copy it in your project and modify it based on your dataset: 
```python 
import scipy.io as scio
from eegunity import UnifiedDataset
from eegunity import get_data_row
import numpy as np
import os
import mne
import setting.path_variable as pv
import json

subject_dict = {
    'A01': {'gender': 'female', 'age': 22},
    'A02': {'gender': 'female', 'age': 24},
    'A03': {'gender': 'male', 'age': 26},
    'A04': {'gender': 'female', 'age': 24},
    'A05': {'gender': 'male', 'age': 24},
    'A06': {'gender': 'female', 'age': 23},
    'A07': {'gender': 'male', 'age': 25},
    'A08': {'gender': 'male', 'age': 23},
    'A09': {'gender': 'male', 'age': 17}
}

# Parameter settings
input_path = pv.bcic_iv_2a_path  # Dataset directory
domain_tag = "bcic-iv-2a"      # Domain tag for marking the dataset
output_path = os.path.join(pv.stage1_output_dir, "bcic_iv_2a")  # Output path
use_cache = False

def app_func(row, output_dir):
    # Load the MNE raw data
    mne_raw = get_data_row(row)
    subject_id = os.path.basename(row['File Path'])[:3]
    description_dict = {
        "original_description": mne_raw.info['description'],
        "eegunity_description": {
            "amplifier": "unknown",
            "cap": "Ag/AgCl",
            "age": subject_dict[subject_id]['age'],
            "sex": subject_dict[subject_id]['gender'],
            "handedness": "unknown"
        }
    }
    mne_raw.info['description'] = json.dumps(description_dict)

    # Rename the channels to the 10-20 system, commonly used for 64 electrode positions
    mne_raw.rename_channels({
        'EEG-Fz': 'Fz', 'EEG-0': 'FC3', 'EEG-1': 'FC1', 'EEG-2': 'FCz',
        'EEG-3': 'FC2', 'EEG-4': 'FC4', 'EEG-5': 'C5', 'EEG-C3': 'C3',
        'EEG-6': 'C1', 'EEG-Cz': 'Cz', 'EEG-7': 'C2', 'EEG-C4': 'C4',
        'EEG-8': 'C6', 'EEG-9': 'CP3', 'EEG-10': 'CP1', 'EEG-11': 'CPz',
        'EEG-12': 'CP2', 'EEG-13': 'CP4', 'EEG-14': 'P1', 'EEG-15': 'Pz',
        'EEG-16': 'P2', 'EEG-Pz': 'POz'
    })

    montage = mne.channels.make_standard_montage('standard_1020')
    mne_raw.info.set_montage(montage, on_missing='ignore')
    mne_raw.set_channel_types({'EOG-left': 'eog', 'EOG-central': 'eog', 'EOG-right': 'eog'})

    # Event ID mapping
    event_id = {
        'Rejected trial': 1,
        'Eye movements': 2,
        'Idling EEG (eyes open)': 3,
        'Idling EEG (eyes closed)': 4,
        'Start of a new run': 5,
        'Start of a trial': 6,
        'Cue onset left (class 1)': 7,
        'Cue onset right (class 2)': 8,
        'Cue onset foot (class 3)': 9,
        'Cue onset tongue (class 4)': 10
    }

    # Extract events and original event IDs
    events, original_event_id = mne.events_from_annotations(mne_raw)

    # Update events based on new event_id mapping
    for event_desc, new_id in event_id.items():
        if event_desc in original_event_id:
            events[events[:, 2] == original_event_id[event_desc], 2] = new_id

    # Check if the file name ends with 'E' and construct the .mat file path
    file_base, file_ext = os.path.splitext(row['File Path'])
    if file_base.endswith('E') and file_ext == '.gdf':
        mat_filepath = f"{file_base}.mat"
        if os.path.exists(mat_filepath):
            mat_data = scio.loadmat(mat_filepath)
            values_from_mat = mat_data[
                                  'classlabel'].flatten() + 6  # Replace 'data' with the correct key in your .mat file

            # Replace events where the last column is 7
            replacement_indices = np.where(events[:, -1] == 7)[0]
            if len(replacement_indices) >= len(values_from_mat):
                events[replacement_indices[:len(values_from_mat)], 2] = values_from_mat
            else:
                print(f"Warning: {mat_filepath} contains fewer values than needed for replacement.")

    # Convert modified events back to annotations
    event_desc = {value: key for key, value in event_id.items()}  # Convert event IDs back to descriptions
    annotations = mne.annotations_from_events(
        events=events,
        sfreq=mne_raw.info['sfreq'],
        event_desc=event_desc  # Mapping event codes to descriptions
    )
    mne_raw.set_annotations(annotations)  # Set new annotations to raw data

    # Save the processed EEG data to the output directory
    filename = os.path.basename(row['File Path'])  # Extract the file name
    # Modify the output filename to conform to MNE naming conventions
    output_filename = f"{filename[:-4]}_raw.fif"  # Assuming the original filename ends with '.gdf', remove the extension and add '_raw.fif'
    output_path = os.path.join(output_dir, output_filename)  # Define the output path
    mne_raw.save(output_path, overwrite=True)  # Save the file, overwriting if necessary

    return None


# 1. Load the dataset directory
locator_path = os.path.join(pv.stage1_locator, os.path.basename(input_path)+".csv")
if os.path.exists(locator_path) and use_cache:
    unified_dataset = UnifiedDataset(locator_path=locator_path)
else:
    unified_dataset = UnifiedDataset(dataset_path=input_path, domain_tag=domain_tag)
    unified_dataset.save_locator(locator_path)

# 2. Batch process EEG data
unified_dataset.eeg_batch.batch_process(
    con_func=lambda row: row['Completeness Check'] == 'Completed',  # Filter out 'Completed' data
    app_func=lambda row: app_func(row, output_path),  # Call the processing function and set output directory
    is_patch=False,  # No patching needed
    result_type=None  # No return type needed
)
```

## Key Points to Remember 
- **Annotations**: All events are stored as MNE annotations, ensuring consistency in event descriptions. For more details, refer to [MNE-Python Annotations](https://mne.tools/stable/auto_tutorials/plot_annotations.html) :contentReference[oaicite:1]{index=1}. 
- **Subject Information**: Subject details (e.g., age, gender) are embedded in the `info['description']` field. 
- **Channel Information**: Renaming channels and setting a standard montage guarantees accurate electrode positioning. 
- **Robustness**: Exception handling is built into the script (e.g., handling irregular file names and time annotations) so that users can process the dataset without manual directory modifications. 
- **Parameter Settings**: Only basic parameters (input_path, domain_tag, output_path, use_cache) need to be specified, with no absolute paths used.